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Abstract 

Background: The covariation of two sites in a protein is often used as the degree of their coevolution. To quantify 
the covariation many methods have been developed and most of them are based on residues position-specific 
frequencies by using the mutual information (Ml) model. 

Results: In the paper, we proposed several new measures to incorporate new biological constraints in quantifying 
the covariation. The first measure is the mutual information with the amino acid background distribution (MIB), 
which incorporates the amino acid background distribution into the marginal distribution of the Ml model. The 
modification is made to remove the effect of amino acid evolutionary pressure in measuring covariation. The 
second measure is the mutual information of residues physicochemical properties (MIP), which is used to measure 
the covariation of physicochemical properties of two sites. The third measure called MIBP is proposed by applying 
residues physicochemical properties into the MIB model. Moreover, scores of our new measures are applied to a 
robust indicator conn(k) in finding the covariation signal of each site. 

Conclusions: We find that incorporating amino acid background distribution is effective in removing the effect of 
evolutionary pressure of amino acids. Thus the MIB measure describes more biological background information for 
the coevolution of residues. Besides, our analysis also reveals that the covariation of physicochemical properties is a 
new aspect of coevolution information. 



Background 

A protein's function depends on its three-dimensional 
(3D) structure and interactions of residues [1]. When 
there is a mutation of functionally or structurally impor- 
tant residues, compensatory mutations may occur to 
preserve or restore the function or structure of the pro- 
tein. Thus the knowledge of residues coevolution helps 
to predict protein function and guide experimental ana- 
lysis. To quantify the coevolution of a protein chain, a 
multiple sequence alignment (MSA) of the chain and its 
homologous sequences are generated at first. Then the 
covariation of two sites is used as the degree of their 
coevolution. 

In order to quantify the covariation of two sites in a 
given MSA, many computational methods have been 
developed in recent years. These methods can be 
divided into two groups: parametric methods and 
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nonparametric ones [2-14]. The parametric methods 
incorporate maximum likelihood approximations [15], 
Bayesian probabilities [16], Phylogentic approaches [17] 
and so on. While nonparametric methods are more 
extensive and most of them are based on mutual infor- 
mation (MI). The MI model is taken from information 
theory [18], and uses the position specific distribution of 
two sites [19-21]. For example, [6] tested the effect of 
the size of MSAs and the mutation rate on two sources 
of background (finite sample size effects and phyloge- 
netic influence). They also tested the performance of 
various normalizations of MI in enhancing the detection 
of coevolving sites. To integrate different biological con- 
straints with the MI model, some methods focused on 
how to relate observation counts to expected distribu- 
tion [22-24]. However, [25] pointed out that none of the 
estimates are more 'correct' than others since all the 
methods merely depend on assumptions. And it has 
been shown that the estimates of MI are more affected 
by these assumptions than by the actually observed data. 
The 'correct' conditional assumptions refer to the 
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assumptions strictly matching known biological con- 
straints. Another widely used nonparametric method is 
ELSC which applies a perturbation-based method to cal- 
culate explicitly the likelihood of evolutionary covariance 
in MS As [26]. Although many biological constraints 
have been used in measuring covariation, the amino 
acid background distribution and their physicochemical 
properties are ignored in previous methods. 

In order to improve covariation measures, two new 
biological constraints are introduced in the paper. The 
first constraint is the amino acid background distribu- 
tion which indicates the evolutionary pressures of amino 
acids. It has been proved that if a residue is under rela- 
tively low evolutionary pressure, it will be readily 
replaced in the evolutionary history. Thus this residue 
will have relatively low frequency in an MSA and vice 
versa. In general, the frequency of a residue can be used 
as an estimate of its evolutionary pressure and is often 
called the background frequency of the residue [27]. A 
residue background distribution is called 'no pressure' if 
it can describe the distribution of amino acids subject to 
no evolutionary pressure [28]. Actually, it depicts evolu- 
tionary pressures for each residue exactly. However, it is 
difficult to estimate the 'no pressure' background distri- 
bution. Thus statistical background usually provides an 
alternative to approximate it. As suggested by several 
existing conservation measures, the BLOSUM62 back- 
ground distribution [29] is used as the amino acid back- 
ground distribution in the paper. By incorporating 
amino acid background distribution into account, we 
proposed a new method MI with amino acid back- 
ground distribution (MIB). 

The second constraint to improve covariation mea- 
sures is the physicochemical properties which are 
important in predicting functional important residues 
[30-36]. In order to incorporate the physicochemical 
properties into the MI model, amino acids are often 
grouped into six disjoint groups or ten overlapping 
groups in previous works [37]. In the study, the classifi- 
cation of ten overlapping groups of amino acids is cho- 
sen since six disjoint groups has a deficiency that 
residues of different types are treated equally different. 
Then, a new method called the MI with physicochem- 
ical properties (MIP) is developed to estimate the varia- 
tion of physicochemical properties of two given sites. 

The third model is called MIBP method which is used 
to estimate the covariation of physicochemical proper- 
ties by removing their background distributions. More- 
over, our measures are applied to a robust indicator 
conn(k) [23] in finding covariation signal of each site. 

Methods 

Since the reliability of MI values depends on that of 
MSA, the quality of MSA is important in measuring the 



covariation. In the paper, MSAs are downloaded from 
the PFAM [38] data base. And the number of sequences 
in an alignment should be greater than 125 [6]. Given 
an MSA, sequences in it are clustered at 90% sequence 
identity and the redundant sequences are removed. 
Moreover, the columns with more than 25% gaps are 
also removed as suggested by [23]. Gaps are also 
ignored when position specific frequencies are calcu- 
lated. Moreover, these proposed methods are also tested 
on a recently published data set which was created by 
Capra and Singh (2007) [28]. MSAs in the data set with 
lower than 125 sequences are removed. After filtering, 
496 MSAs remained and results on them can be found 
in the supplement material. Throughout the paper, we 
use N to denote the number of sequences in the MSA 
and c_pair (K,L) to denote the pair of column K and 
column L used to calculate the MI based measures. 

The Ml model with the amino acid background 
distribution (MIB) 

An ordering is first specified to the 20 amino acids. 
Given an MSA, the amino acid position specific fre- 
quency of a column K for the «th residue is calculated 
as: 



count(K') 
N ' 



where count(/<') denotes the number of the ith (i = 1, 
20) residue in column K. 

Similarly, the joint probability distribution of c_pair 
(K,L) for a residue pair consisting of the /th and /'th resi- 
due is defined as: 



p{K\V) 



counX{K\V) 



where count(/'C', V ) denotes the number of rows, in 
which the residues in column K and L are the ith. and 
/'th residue, respectively. 

Based on the above definitions, the classical MI [18] is 
calculated as: 



20 20 



Clearly, if pilC , V ) = Q, then the value of the mono- 

mial pfK^'.L'lloe ■ — '■ — is 0. Moreover, if one col- 

umn in a column-pair is fully conserved, then M I {K, 
L) = 0. As suggested by [6], we further normalize M I 
{K, L) by H {K, L), and define 



MI'[K,L) 



MI[K, L) 
H(K, L) ' 



Gao et al. BMC Bioinformatics 201 1, 12:206 
http://www.biomedcentral.eom/1 471 -2 1 05/1 2/206 



Page 3 of 1 3 



where H{K, L) = ' logP(^'' ^^)- 

In order to account for the background distribution of 
amino acids, we introduce the marginal distribution, 
which is defined as: 



20 



where q is the BLOSUM62 amino acid background 
distribution. Then a new estimation of covariation called 
the MI with the amino acid background distribution 
(MIB) is defined as: 



20 20 



M/B(i^,L) = ^^p(iC', Li) log 



P{K',V) 

Pb{Ki)-pb[Dy 



Similarly, if p{lC, L' ) = Q, then the value of the mono- 
t)(K^ 1 )\ 

mial p{K\ V) loe ■. — '- — is considered as 0. 

In addition, we also normalize MIB(K, L) by H(K, L) 
and define a new measure MIB', 



MIB'[K,L) 



MIB{K, L) 



H[K,L) 

MIB' is sometimes referred to as the normalized MIB. 

The covariation of physicochemical properties 

To measure the covariation of amino acid physicochem- 
ical properties, amino acids are grouped into ten over- 
lapping groups as suggested by Taylor [37]. The ten 
overlapping groups are: hydrophobic (A, G, C, T, I, V, 
L, K, H, F, Y, W, M), aromatic (F, Y, W, H), aliphatic (I, 
V, L), tiny (A, S, G, C), small (P, N, D, T, C, A, G, S, V), 
proline (P), charged (K, H, R, D, E), negative (D, E), 
polar (N, Q, S, D, E, C, T, K, R, H, Y, W) and positive 
(K, H, R). In this section, we use g_pair(a,b) to denote 
the group pair, in which the former residue belongs to 
group a and the latter residue belongs to group b. 

Then, the fractional frequency of group a in column K 
is defined as: 



An = 



count (iC") 



where count(/<") is the number of residues belongs to 
group a in column K. 

The joint property distribution of c_pair (K,L) for 
g_pair(a,b) is defined as: 

, countCi^M") 



PpiKM") 



N 



where the count{/<^, L*) is the number of residue-pairs 
belonging to g_pair(a,b). 
We define a new measure of covariation as: 

If the Pp(K", L ) = 0, the value of the monomial 
£) (K" L*"! 

p J K'',L'') loe— ^ — '- — -r- is 0. If one column of the 

column-pair is fully conserved, the value is also 0. The 
measure is referred to as the MI of physicochemical 
properties (MIP) and the normalized MIP is defined as: 



MIP'[K,L) 



MIP[K, L) 
Hp(K,L) ' 



10 10 

Where Hp{K, L) = Pp^^"' ^''^ logPp^^'' ^''^ 

a=l 6=1 

Moreover, amino acid physicochemical properties are 
also incorporated to the MIB model. The modified mar- 
ginal fractional frequency of group a in column K is 
defined as: 



PbpiK'-) 



PJKI 

n til 
m=l Hp 



Here, qj, is the background distribution of physico- 
chemical properties which is based on the BLOSUM62 
amino acid background distribution. 

Similarly, the covariation of the c_pair(K,L) is calcu- 
lated as: 



10 10 



M7BP(iC,L) = ^^Pp(r,L'')log 



fl=i ii=i 



PbpiK") ■ PbpiL")- 



If Pp{K", L^) = 0, the value of the monomial 
p (K" L^) 

Pn(K'^,L^)los, ;— is considered as 0. The 

' ' ^Ppm-pp{Lb) 

measure is referred to as the MIP with the physico- 
chemical properties background distribution (MIBP) in 
the paper. And the normalized MIBP is defined as: 



MIBP'{K,L) 



MIP(K, L) 
Hp[K,L) ■ 



The measures MI', MIB', MIP' and MIBP' are used to 
quantify covariation in the paper. 

conn (k) Is a more robust indicator than individual 
covariation score 

It has been proved that individual MI values may be 
misleading [39]. In order to improve individual scores. 
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[23] introduced an indicator conn(k) to characterize 
individual residues. In the paper, the top 75 high-scoring 
pairs are taken into account. And the cut-off of conn(k) 
is 5 in MI', MIB' and a comparing method ELSC. While 
in MIP' and MIBP', since there are ten groups, we take 
the top 25 high-scoring pairs and the cut-off of conn(k) 
is 3. Throughout this paper, we use co««('A^-name to 
denote the conn(k) score of the 'name' method. For 
example, conn(k)-MV denotes the conn(k) score of MI'. 

Results and Discussion 

In order to compare our new methods with existing 
ones, all chosen methods, namely MI', H2r, ELSC, MIB', 
MIP' and MIBP' are tested on four MSAs. The first one 
is a toy MSA is shown in Table 1 with 6 sequences and 
each sequence has 6 residues. Others are commonly 
used protein families in comparing coevolution meth- 
ods, which include IJXA-A, 1B93-A [28] and PF01053 
protein family. We use IQGN to denote the related pro- 
tein sequence of PF01053. Results on these MSAs are 
shown in Tables 2, 3, 4 and 5 and Figures 1, 2, 3, 4, 5 
and 6[40]. However, ELSC is not applied to PF01053, 
for the method suffers an arithmetic overflow when 
sequence number is too large. 

Comparison of different methods on a toy MSA 

In order to illustrate the effects of new biological con- 
straints, we generate a toy MSA, which is shown in 
Table 1. In the MSA, column 1 and 2 are fully con- 
served which are used to illustrate the effect of the 
amino acid background distribution; column 3, 4 and 5 
are used to illustrate the differences between the MIP' 
and MIBP' methods; and column 6 is randomly gener- 
ated. We then ran and compared all the methods on the 
toy MSA, and the results are listed in Table 2. For con- 
venience, a pair of columns A and B is represented by 
c_pair(A,B) in the study. 

In this section, MI' and MIB' are compared to show 
the effect of incorporating amino acid background dis- 
tribution. For column-pairs with one column fully con- 
served, such as c_pair(l,3), c_pair(l,4) and c_pair(l,5), 
the MT scores are 0, while MIB' can distinguish them 



Table 1 A toy MSA used to illustrate the differences 
between these referred methods 
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T 


D 


A 


Y 


C 


M 


T 



In the table, Ci denotes the /th column. 



with different scores 0.138, 0.237 and 0.141, respectively. 
When the first column is fully conserved, the joint fre- 
quency of c_pair(l,K) is equal to the observed frequency 
of the residue represented by column K. On the other 
hand, the observed distribution of residues in column K 
is used as the marginal distribution, therefore the MI' 
score of c_pair(l,K) is 0. However for MIB', the mar- 
ginal distribution is the observed distribution modified 
by the amino acid background distribution, thus, the 
MIB' scores of these column-pairs are distinguishable. 
For example, the MI' score of c_pair(l,6) is 0 while it is 
0.003 for MIB'. Although the difference between 0 and 
0.003 is not significant, it reveals that MI' and MIB' are 
essentially different. In general, the MI' scores of col- 
umn-pairs with one fully conserved column are 0, while 
the MIB' method can distinguish them unless both col- 
umns are conserved. This is because that the marginal 
distribution used in the MIB' methods has been modi- 
fied by the amino acid background distribution. In addi- 
tion, the columns which are paired with the same 
column and get equal MT scores, such as the c_pair 
(3,6), c_pair(4,6) and c_pair(5,6), are also distinguishable 
in MIB'. Although the distributions of column 3, 4 and 
5 are the same in number, the type of amino acids are 
different. The result uncovers that the amino acid back- 
ground distribution is meaningful in measuring the rela- 
tionship between columns. For column-pairs with fully 
covariant columns, the MIB' scores might be different. 
For example the MIB' scores for c_pair(3,4), c_pair(3,5) 
and c_pair(4,S) are 1.375, 1.279 and 1.378, respectively. 

Different from methods based on amino acid frequen- 
cies, the MIP' measure is based on amino acids physico- 
chemical properties. In detail, the MIP' values of c_pair 
(4,5), c_pair(4,6) and c_pair(S,6) are 0.077, 0.037 and 
0.053, respectively, while the MI' scores are 0.296. Table 

2 shows that the covariation of physicochemical proper- 
ties of c_pair(4,5) is stronger than that of c_pair(4,6) and 
c_pair(5,6), while they are equivalent by MI'. Similar 
results can be found in the comparison of MIP' and 
MIBP'. Table 2 also shows that the MIP' score is 0 if one 
column in the pair is fully conserved while this scenario 
doesn't appear under the MIBP' measure. It worths not- 
ing that the score of c_pair(l,2) is 0 under the MI', MIB' 
and MIP' measures since these two columns are fully 
conserved. However, MIBP' can estimate the covariation 
of them and gives a score 33.425 to the column-pair. 
Another interesting observation is that: although column 

3 is not fully conserved, the ten overlapping classifier can 
not distinguish W and Y. Thus the column with the fully 
conserved properties leads to a MIP' score of c_pair(3,K) 
0. So do G and A, I and L [30]. In contrast, under the 
MIBP' measure, these pairs can be distinguished due to 
the modification of marginal distribution by the back- 
ground distribution of physicochemical properties. 
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Table 2 The Ml', MIB', MIP' MIBP' and ELSC scores based on the toy MSA 



Ml' MIB' 
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0.296 






0 
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0.360 
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1 


0.296 








0 
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0 
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36.015 
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3.655 
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0 


0 
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25.528 
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4.642 


C3 
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0 


0 
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27.707 


3.387 
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C4 
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0.037 








0 
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4476 


C5 










0 


0.053 










0 


1.854 


C6 












0 












0 



ELSC 
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0 
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0.69 






C4 


0 


2.20 






0.69 






C5 




0 






0.69 






C6 










0 






In the table, O denotes the fth column. 
















The effect of incorporating amino acid background 
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In order to test the effect of incorporating amino acid 
















background distribution, MIB' is compared with MI' 
















and H2r on the PF01053 protein family. We plot the 


Table 3 Comparisons of the H2r, Ml', MIB', MIP' and 


relatively high conn(k)-MlB' scored sites in Figure 5 and 


MIBP methods based on the PF01053 family 




list the detailed information in Table 3. Table 3 shows 
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Ml' 




MIB' 


MIP' 


MIBP' 


that the differences among MIB', MI' and H2r are 
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obvious. For example, the highest conn(k)-MlB' site is 
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site 389, while that of MI' and H2r is site 388. 


305 6 272 


8 
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236 
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261 8 


In addition, the number of the sites with conn(k)- 


276 6 313 


8 


393 


8 


261 


5 


313 6 


Mr>= 5 is greater than that of MIB' and H2r. Moreover, 


307 5 207 
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391 
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313 


4 


272 6 


site 389, a functional important site, gets the highest 


389 5 138 


7 


268 


7 


423 


4 


423 5 


conn (k) -MIE' , but its conn(k)-H2r value is less than 5. It 


285 4 393 
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386 


7 


272 


4 


265 5 


is already known that in the models of the substrate 


386 4 236 
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305 
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427 


4 


207 5 


tCGS (cystathionine /-synthase from Nicotiana taha- 


389 4 268 


6 


280 


6 


207 


4 


427 4 


cum) complex (IQGN) [41], the movement of 03' from 


265 


6 


402 


5 








a mainly hydrophobic environment, arounds site 389 


261 


6 


308 


5 








(Phe). Furthermore, sites 239 (Phe) and 211 (Asn), 


391 


6 


211 


5 








which are essential for maintaining functional environ- 


389 


5 












ment, are ranked as the first and third sites interlinked 


386 


5 












with site 389 by the MIB' method. As shown in Figure 


305 


5 












2, sites, which get have relatively high MIB' scores with 


108 


5 












site 389, form a network around it. It demonstrates that 


211 


5 












strongly covariation sites surround important functional 


In the table, k denotes the /rth site of the IQGN chain 
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Table 4 Comparisons of the H2r, ELSC, Ml', MIB', MIP' and MIBP' methods based on the 1JXA-A family 
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99 
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125 8 
26 7 

In the table, k denotes the ^h site of the IJXA-A chain. 



domains to make a compensatory effect to maintain the 
function of the protein, a conjecture being vaUdated by 
many case studies [5,13,39]. These case studies are also 
tested on the 1B93-A family and the general results are 
similar. 

Besides H2r and MI', MIB' was also compared with 
the ELSC method proposed by [26]. ELSC uses a pertur- 
bation-based algorithm for calculating explicit likelihood 
of subset covariance, however the background distribu- 
tion is also ignored. ELSC and MI' are similar since the 
background distribution is not taken into account for 
both methods. As shown in Table 4, the sites with high 
conn(k)-MlB' and those of ELSC are different. Only site 
329 is predicted by both methods. Besides, the sites with 
high conn(k)-MlB' scores are mostly in the N terminal 
while that of ELSC are dispersed. On the 1B93-A family 
the ELSC shows no significant advantage over other 
methods. Moreover, it suffers arithmetic overflow when 
sequence number of the MSA is too large. In general, 
these results suggest that incorporating amino acid 
background is very important in measuring coevolution. 

Coevolution of amino acid physicochemical properties 

In contrast to previous methods which consider amino 
acids as symbols in a uniformly diverse alphabet, MIP' 
and MIBP' try to account for amino acid physicochem- 
ical properties. As shown in Table 3, the sites predicted 
by MIP' and MIBP' are different to those by MI' and 
MIB'. The highest conn(k)-MI' site is site 388, while 
that of MIP' and MIBP' is site 265 and 236, respectively. 
In addition, some sites predicted by MIP' and MIBP' are 
also predicted by MI', but not by MIB'. These results 
demonstrate that the classification of amino acids and 
physicochemical properties are different in depicting the 
MSA. Moreover, site 236 and 261 in PF01053 family, 
site 504, 481, 485 and 603 in IJXA-A family, and site 19 



and 91 in 1B93-A family are catalytic residues. It means 
that there are physicochemical properties based net- 
works to maintain the catalytic environment or support 
the catalytic process. 

An in-depth analysis on PF01053 family is also given 
to show the performances of MIBP'. It has been proven 
that the carboxylate OD2 of site 236 stabilizes the posi- 
tively charged pyridine nitrogen of PLP [41]. Site 261 
takes its role in catalysis by tCGS named ping-pong 
mechanism in the first and final steps. They coevolve to 
maintain specific physicochemical environment stabiliza- 
tion in or around the active region. As shown in Table 
4, the high conn(k)-MlBP' scored sites are in the N- 
terminal of the sequence while that of the MIP' are in 
the C-terminal. It demonstrates that the results are 
affected by background distribution significantly. More- 
over, four of five catalytic residues are highly scored in 
the IJXA-A family and the related structure [42]. Cata- 
lytic sites are located in the N-terminal half of the 
domain at the carboxyl edge of the /J-sheet. And the 
sites with high conn(k)-MlBP' scores are in the isomer- 
ase domain of GlmS (248-608) which is responsible for 
the binding of Fru6P and its conversion to GlcN6P. 
However site 504 is separated to other high scoring sites 
in space. In [43], it was shown that residue 504 plays a 
key role in the sugar ring opening on a different poly- 
peptide chain. A Schiff base with residue 603 is formed 
and it is replaced by the incoming ammonia in the 
Schiff base. It also indicates that the mutation of residue 
603 from Lys to Arg results in a decreasing of the 
synthase activity and an increasing of the isomerase 
activity. Site 485 forms H bond with the hydroxyl 
groups of the sugar. 

To compare with the conservation information, the 
JSD method [28] is used to rank the degree of sites in 
the IJXA-A family at different identity thresholds and 
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Table 5 Comparisons of the H2r, Ml', MIB', MIP' and MIBP' methods based on the 1B93-A family 





H2r 




ELSC 




Ml' 




MIB' 




MIP' 




MIBP' 


k 


conn(k) 


k 


conn(k) 


k 


conn(k) 


k 


conn(k) 


k 


conn(k) 


k 


conn(k) 


90 


10 


103 


15 


90 


13 


90 


22 


42 


8 


91 


8 


29 


8 


29 


10 


103 


9 


29 


15 


67 


6 


23 


7 


110 


7 


1 10 


9 


1 29 


8 


109 


9 


131 


3 


19 


6 


103 


6 


90 


8 


67 


8 


104 


7 


16 


3 


48 


6 


130 


6 


78 


8 


132 


7 


67 


7 






45 


6 


109 


5 


71 


8 


110 


7 


93 


6 






70 


6 






99 


8 


71 


7 


97 


6 






69 


6 






109 


7 


130 


7 


71 


6 






123 


3 






104 


6 


131 


6 


103 


5 














89 


6 


29 


6 


110 


5 














67 


6 


104 


6 


99 


5 














97 


5 


109 


5 


















65 


5 


65 


5 















In the table, k denotes the kth site of the 1 B93-A chain. 



the related results are shown in the supplement material 
(Additional files 1, 2, 3 and 4). As shown in Table 6, the 
sites with high conn(k) scores are not highly scored by 
the JSD. Among the sites predicted by MI', only sites 86 
and 84 are in the top ten of the JSD rank. And in the 
MIB', the sites are less conserved than MI'. Moreover, 
for MIP' and MIBP', although the sites are relatively 
conserved, the JSD rank can not correspond to the conn 



(k) value. In general, the relevance of conservation and 
covariation is not so high, which demonstrates that the 
conservation and covariation information are relatively 
independent properties of proteins. 

Relations between different measures 

A comparison of the MI', MIB', MIP' and MIBP' mea- 
sures was given based on the MSA of PF01053 and the 




Figure 1 The highest scoring residues k of 1QGN interlinked with position 388 based on Ml'. These sites are plotted in space filling mode 
and the site 388 is colored in red. Others are colored in turn of green, yellow, magenta, cyan, orange, tint white and grey with the decrease of 
the Ml' values. 
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Figure 2 The highest scoring residues k of 1QGN interlinked with position 389 based on MIB' These sites are plotted in space filling 
mode and the site 389 is colored in red. Others are colored in turn of green, yellow, magenta, cyan, orange, tint white and grey with the 
decrease of the IVIIB' values. 




Figure 3 The highest scoring residues k of 1QGN Interlinked with position 265 based on MIP'. These sites are plotted in space filling 
mode and the site 265 is colored in red. Others are colored in turn of green, yellow, magenta, cyan, orange, tint white and grey with the 
decrease of the IVIIP' values. 

V J 
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Figure 5 The highest scored residues of 1QGN based on the conn(k)-IVllBP' scores. Residues of tlie 1QGN witin conn(k)-MlBP' scores >4 are 
plotted in space filling mode and labeled. The sites 236 and 261 with the highest score 8 are colored in red. In addition, they are catalytic sites 
of the chain. Other chosen residues are colored in turn of green, yellow and magenta with the decrease of the conn(k)-MlBP' scores. 
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Figure 6 The highest scored residues of 1JXA-A based on the conn(l<)-MIBP' scores Residues of the IJXA-A with conn(k)-M\BP' scores >3 
are plotted in space filling mode and labeled. The site 504 with the highest score 8 is colored in red. Other chosen residues are colored in turn 
of green, yellow, magenta and cyan with the decrease of the conn(k)-Ml{iP' scores. 



related protein structure is PDB ID: 1QGN[28]. In Fig- 
ures 1, 2, 3 and 4 we found that with different biological 
constraints, the highest scoring sites are different and 
the comutating sites are also different. In the MI' 
method, the highest scoring site is 388 while that of 
MIB', MIP' and MIBP' are 389, 265 and 236, respec- 
tively. On the other hand, to make certain differences 
between these measures, we used the Friedman test to 
judge whether the performance statistics for these 



measures are significantly different. The difference 
between the performances of two measures is called sta- 
tistically significant if the P-value estimated by the 
Friedman test with Bonferroni correction is less than 
0.05 [28]. 

To testify the validity of the amino acid background 
distribution, the MI' and MIB' measures were compared. 
As shown in Table 7, the P-value between MI' and MIB' 
is 6.82 X 10'*, thus there is significant difference 



Table 6 Comparison of conservation method and coevolution methods based on 1JXA-A family 



(k)-m' ranic 


JSD ranl< 


connf^-IVlIB' ranic 


JSD ranl( 


conn(k)-M\P' rank 


JSD ranl( 


conn(k)-M\BP' rank 


JSD rank 


32 


64 


313 


310 


87 


12 


504 


2 


29 


23 


351 


28 


78 


59 


350 


52 


73 


21 


503 


42 


84 


10 


481 


29 


84 


10 


329 


97 


86 


3 


485 


27 


86 


3 


331 


288 


73 


21 


598 


11 


87 


12 


539 


24 


99 


56 


396 


35 


78 


59 


502 


134 


29 


23 


354 


16 


27 


65 


502 


73 


123 


19 


403 


26 


123 


19 


254 


45 






603 


32 


99 


56 


332 


151 










125 


36 














26 


31 















The column of conn(k)-name rank represents the sites with conn(k) scores in 'name' method. And the sites are presented from the high conn(k) scores to low 
ones. The column of JSD rank represents the corresponding JSD rank of the site. 
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Table 7 The P-values between the Ml', MIB', MIP', MIBP' 
methods 



Measure 


Ml' 


MIB' 


MIP' 


MIBP' 


Ml' 


1 


6.82 X 1 0""* 


2.03 X 10'' 


7.7 X 1 0"^ 


MIB' 




1 


9.63 X 10"*' 


3.68 X ]0'^ 


MIP' 






1 


8.27 X 10"^ 


MIBP' 








1 



between them. It confirmed the conclusion of [25] that 
the biological constraints is meaningful in measuring 
covariation, and evolution pressures of amino acids are 
removed successfully by using their background distri- 
bution. To clarify differences between covariation of 
amino acids and covariation of amino acid physico- 
chemical properties, the MI' and MIP' measures were 
also compared. The P-value between the MI' and MIP' 
measures is 2.03 x 10' . The P-value indicates that the 
covariation of amino acids is significantly different to 
that of physicochemical properties. These results indi- 
cate that the covariation of physicochemical properties 
is a new aspect of biological information to detect 
coevolution. 

Effect of sequence identity threshold 

The effect of sequence identity thresholds in measuring 
coevolution is tested in this section. Sequence identity 
thresholds of 90%, 80%, 70% and 60% are chosen as 
examples on the IJXA-A protein family. The identity 
thresholds < 60% are ignored since the account of the 
number of sequences is less thanl25. Performances of 
MIB' on the protein family are shown in Table 8 and 
performances of other methods are shown in the sup- 
plement materials. Table 8 shows that sites change sig- 
nificantly with the decreasing of the identity thresholds 
and the numbers of the predicted sites are also 

Table 8 Performances of MIB' at different sequence 



identity 





90 




80 




70 




60 


k 


conn(k) 


k 


conn(k) 


k 


conn(k) 


k 


conn(k) 


313 


12 


313 


16 


313 


12 


238 


24 


351 


8 


400 


6 


238 


11 


332 


8 


503 


8 


332 


6 


502 


8 


331 


7 


329 


7 


502 


6 


331 


6 






331 


6 


329 


6 


332 


5 






539 


6 






329 


5 







502 6 

602 5 

254 5 

332 5 



The MIB' method is tested on the IJXA-A protein family at sequence identities 
90%, 80%, 70% and 60%. 



decreasing. For example, site 313 is the highest conn(k)- 
MIB' scoring site at the identity thresholds of 90%, 80% 
and 70%, while it is not predicted at the identity thresh- 
old of 60%. And at relatively high identity thresholds, 
site 238 is not predicted while its conn(k)-Ml'B' score is 
high at relatively low identity thresholds. Moreover, 
there are only 3 sites left at the identity threshold 60%. 
It means these three sites coevolve with many columns 
which do not coevolve with other columns strongly. 
While at 90% threshold, the information is enough to 
reveal the covariation of those columns. Thus, it demon- 
strates that the lower identity threshold affects the cov- 
ariation significantly. It may lead to the consequence 
that some sites without coevolved property are mista- 
kenly regard as the highly coevolved sites. In contrast, 
some sites with coevolved property are not predicted 
since the signal of coevolution reduces with the lower 
identity. It is true that sequences with high identity 
would reduce the correctness of the MSA, thus identity 
threshold of 95% or 90% is acceptable. The related 
results are submitted with the paper. 

Conclusions 

In this study, we propose two new biological constraints, 
based on which several new measures are designed to 
detect protein residue coevolution. The first constraint 
is amino acid background distribution which is used to 
develop the MIB' method. In contrast to previous meth- 
ods which focused on how to transform joint count to 
joint frequency, MIB' method tries to remove the effects 
of amino acids evolutional pressures in the measure. By 
incorporating the new biological constraints, we found 
that MIB' is more effective in measuring amino acid 
coevolution. The second constraint is the physicochem- 
ical properties of amino acids which are used in the 
MIP' method. Motivated by the MIB' method, MIBP' 
method which removes the physicochemical properties' 
evolutional pressures is also proposed. Results show that 
the MIBP' method is sensible to catalytic sites. It indi- 
cates that physicochemical properties of residues around 
catalytic sites are strongly evolved. Moreover, results 
show that the MIBP' measure is significantly different 
from methods based on amino acid distribution. Thus 
the covariation of physicochemical properties supplies 
new coevolution information. 

Additional material 



Additional file 1: Data sets. All MSAs used in the study are given in the 
file 'Additional-File-l.zip'. 

Additional file 2: Results. A zip file (Addtional-File-2.zip) contains al 
results on above data sets. 

Additional file 3: Perl Code. A zip file (Perl-codes.zip) contains all Perl 
codes used in the study. 
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Additional file 4: Effects of sequence identity The supplement 
material includes the comparisons of these proposed methods on 
different sequence identities based on the IJXA-A family. Their 
corresponding JSD ranks are also shown in the material. 
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